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ABSTRACT 

Matching dependencies (MDs) were introduced to specify 
the identification or matching of certain attribute values in 
pairs of database tuples when some similarity conditions are 
satisfied. Their enforcement can be seen as a natural gener- 
alization of entity resolution. In what we call the pure case 
of MDs, any value from the underlying data domain can be 
used for the value in common that does the matching. We 
investigate the semantics and properties of data cleaning 
through the enforcement of matching dependencies for the 
pure case. We characterize the intended clean instances and 
also the clean answers to queries as those that are invariant 
under the cleaning process. The complexity of computing 
clean instances and clean answers to queries is investigated. 
Tractable and intractable cases depending on the MDs and 
queries are identified. Finally, we establish connections with 
database repairs under integrity constraints. 

1. INTRODUCTION 

A database instance may contain several tuples and val- 
ues in them that refer to the same external entity that is 
being modeled through the database. In consequence, the 
database may be modeling the same entity in different forms, 
as different entities, which most likely is not the intended 
representation. This problem could be caused by errors in 
data, by data coming from different sources that use differ- 
ent formats or semantics, etc. In this case, the database 
is considered to contain dirty data, and it must undergo a 
cleansing process that goes through two interlinked phases: 
detecting tuples (or values therein) that should be matched 
or identified, and, of course, doing the actual matching. This 
problem is usually called entity resolution, data fusion, du- 
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plicate record detection, etc. Cf. [13, 10] for some recent 
surveys and [6] for recent work in this area. 

Quite recently, and generalizing entity resolution, [f 4, 15] 
introduced matching dependencies (MDs), which are declar- 
ative specifications of matchings of attribute values that 
should hold under certain conditions. MDs help identify 
duplicate data and enforce their merging by exploiting se- 
mantic knowledge expressed. 

Loosely speaking, an MD is a rule defined on a database 
which states that, for any pair of tuples from given relations 
within the database, if the values of certain attributes of the 
tuples are similar, then the values of another set of attributes 
should be considered to represent the same object. In con- 
sequence, they should take the same values. Here, similarity 
of values can mean equality or a domain-dependent similar- 
ity relationship, e.g. related to some metric, such as the edit 
distance. 

Example 1. Consider the following database instance of a 
relation P. 



Name 


Phone 


Address 


John Smith 


723-9583 


10-43 Oak St. 


J. Smith 


(750) 723-9583 


43 Oak St. Ap. 10 



Similarity of the names in the two tuples (as measured by, 
e.g. edit distance) is insufficient to establish that the tuples 
refer to the same person. This is because the last name is a 
common one, and only the first initial of one of the names is 
given. However, similarity of their phone and address values 
indicates that the two tuples may be duplicates. This is 
expressed by an MD which states that, if two tuples from 
P have similar address and phone, then the names should 
match. In the notation of MDs, this is expressed as 

P[Phone] « P[Phone] A P[Address] « P[Address] -> 

P[Name\ f± P[Name\. □ 

The identification in [14, 15] of a new class of dependencies 
and their declarative formulation have become important 
additions to data cleaning research. In this work we investi- 
gate matching dependencies, starting from and refining the 
model-theoretic and dynamic semantics of MDs introduced 
in [15]. 

Any method of querying a dirty data source must address 
the issue of duplicate detection in order to obtain accurate 
answers. Typically, this is done by first cleaning the data by 
discarding or combining duplicate tuples and standardizing 
formats. The result will be a new database where the entity 
conflicts have been resolved. However, the entity resolution 
problem may have different solution instances (which we will 



simply call solutions), i.e. different clean versions of the 
original database. The model-theoretic semantics that we 
propose and investigate defines and characterizes the class 
of solutions, i.e. of intended clean instances. 

After a clean instance has been obtained, it can be queried 
as usual. However, the query answers will then depend on 
the particular solution at hand. So, it becomes relevant to 
characterize those query answers that are invariant under 
the different (sensible) ways of cleaning the data, i.e. that 
persist across the solutions. This is an interesting problem 
per se. However, it becomes crucial if one wants to obtain 
semantically clean answers while still querying the original 
dirty data source. 

This kind of virtual cleaning and query answering on top 
of it have been investigated in the area of consistent query 
answering (CQA) [3], where, instead of MDs, classical in- 
tegrity constraints (ICs) are considered, and database in- 
stances are repaired in order to restore consistency (cf. [9, 
7, 11] for surveys of CQA). Virtual approaches to robust 
query answering under entity resolution and enforcement of 
matching dependencies are certainly unavoidable in virtual 
data integration systems. 

In this paper we make the following contributions, among 
others: 

1. We revisit the semantics of MDs introduced in [15], 
pointing out sensible and justified modifications of it. 
A new semantics for MD satisfaction is then proposed 
and formally developed. 

2. Using the new MD semantics, we formally define the 
intended solutions for a given, initial instance, -Do, that 
may not satisfy a given set of MDs. They are called 
minimally resolved instances (MRIs) and are obtained 
through an iteration process that stepwise enforces the 
satisfaction of MDs until a stable instance is reached. 
The resulting instances minimally differ from Do in 
terms of number of changes of attribute values. 

This semantics (and the whole paper) considers the 
pure case introduced in [15], in the sense that the val- 
ues than can be chosen to match attribute values are 
arbitrarily taken from the underlying data domains. 
No matching functions are considered, like in [6], for 
example (where entire tuples are merged, not individ- 
ual attribute values). 

3. We introduce the notion of resolved answers to a query 
posed to Do- They are the answers that are invariant 
under the MRIs. 

4. We investigate the computability and complexity of 
computing MRIs and resolved answers, identifying syn- 
tactic conditions on MDs and conjunctive queries un- 
der which the latter becomes tractable via query rewrit- 
ing. The rewritten queries are allowed to contain count- 
ing and transitive closure (recursion). 

5. We identify cases where computing (actually, deciding) 
resolved answers is coNP-complete. 

6. We establish a connection between MRIs and database 
repairs under functional dependencies as found in CQA. 
In the latter case, the repairs consider, as usual, a no- 
tion of minimality based on deletion of whole tuples 
and comparison under set inclusion. This reduction 



allows us to profit from results in CQA, obtaining ad- 
ditional (in)tractability results for MDs. 

This paper is organized as follows. Section 2 presents basic 
concepts and notations needed in the rest of the paper. Sec- 
tion 3 identifies some problems with the MD semantics, and 
refines it to address them. It also introduces the resolved in- 
stances and resolved answers to a query. Section 4 considers 
the problems of computing resolved instances and resolved 
query answers. Section 5 identifies queries and sets of MDs 
for which computing resolved answers becomes tractable via 
query rewriting. Section 6 establishes the connection with 
CQA. Section 7 presents some final conclusions. 

2. PRELIMINARIES 

In general terms, we consider a relational schema 5 that 
includes an enumerable infinite domain U. An instance D 
of S can be seen as a finite set of ground atoms of the form 
R(i), where R is a database predicate in S, and i is a tuple of 
constants from U. We assume that each database tuple has 
an identifier, e.g. an extra attribute that acts as a key for 
the relation and is not subject to updates. In the following 
it will not be listed, unless necessary, as one of the attributes 
of a database predicate. It plays an auxiliary role only, to 
keep track of updates on the other attributes. R(D) denotes 
the extension of R in D. We sometimes refer to attribute 
A of R by i?[j4]. If the ith attribute of predicate R is A, 
for a tuple t = (ci, . . . , Cj) £ R(D), t[A] denotes the value 
Ci. The symbol t[A] denotes the vector whose entries are 
the values of the attributes in the vector A. The attributes 
may have subdomains that are contained in U. Constants 
will be denoted by lower case letters at the beginning of the 
alphabet. 

A matching dependency [14], involving predicates 
R(A\, . . . , A n ), S(Bi, . . . , B m ), is a rule of the form 

f\ R[Ai] « I3 S[B 3 ] -> /\ R[Ai] ^ S[B 3 ]. (1) 

i£I.j£J i£I',j£J' 

Here 7? and S could be the same predicate. I, 1' and J, J' are 
fixed subsets of {1, . . . , n} and {1, . . . , m}, resp. We assume 
that, when Ai, Bj are related via ~ij or ^ in (1), they share 
the same (sub)domain, so their values can be compared by 
the domain-dependent binary similarity predicate, or 
can be identified, resp. 

The similarity operators, generically denoted with w, are 
assumed to have the properties of: (a) Symmetry: If x ss y, 
then y ~ x. (b) Equality subsumption: If x = y, then x ~ y. 

The MD in (1) is implicitly universally quantified in front 
and applied to pairs of tuples ti , ti for R and S, resp. The 
expression /\R[Ai] ~ij 5[Bj] states that the values of the 
attributes Ai in tuple ti are similar to those of attributes 
Bj in tuple t^. If this holds, the expression i?[y4i] ^ Sf-Bj] 
indicates that, for the same tuples t\ and ti, t\\Ai\ and ti\Bj\ 
on the RHS should be updated so that they become the 
same, i.e. their values are identified or matched. However, 
the attribute values to be used for this matching are left 
unspecified by (1). 

For abbreviation, we will sometimes write MDs as 

R[A] « S[B] ->■ R[C] - S[E], (2) 

where A, B, C, and D represent the lists of attributes, 
[A 1 ,...,A k ), {B u ...,B k ), (Ci,...,CV), and {E u ...,E h ,), re- 
spectively. We refer to the pairs of attributes (Ai,Bi) and 



(d,Ei) as corresponding pairs of attributes of the pairs 
(A,B) and (C,E), respectively. For an instance D and a 
pair of tuples t\ G R(D) and t 2 G S'(D), ii[^4] « *2[B] indi- 
cates that the similarities of the values for all corresponding 
pairs of attributes of (A,B) hold. Similarly, ti[C] = ^[-E] 
denotes the equality of the values of all pairs of correspond- 
ing attributes of (C,E). 

Since an MD involves an update operation, the MD is a 
condition that is satisfied by a pair of database instances: 
an instance D and its updated instance D' . 

Definition 1. [15] Let D, D' be instances of schema S with 
predicates R and 5", such that, for each tuple t in D, there 
is a unique tuple t' in D' with the same identifier as t, and 
viceversa. The pair (D, D') satisfies the MD m in (2), de- 
noted (D,D') \=f rn, iff, for every pair of tuples tR G R{D) 
and ts G S(D), if tR and ts satisfy tR[A] ~ ts[B], then for 
the corresponding tuples t' R and t' s in R(D'), S(D'), resp., 
it holds: (a) t' R [C\ = t' s [E], and (b) t' R [A] » t' s [B\. □ 

Intuitively, D' in Definition 1 is an instance obtained from 
D by enforcing m on instance D. For a set M of MDs, 
and a pair of instances (D,D'), (D,D') \=f M means that 
(D, D') \=f rn, for every m G M. 

An instance D' is siaMe [15] for a set M of MDs if (D' , D') 
I=f M. Stable instances correspond to the intuitive notion 
of a clean database, in the sense that all the expected value 
identifications already take place in it. Although not explic- 
itly developed in [15], for an instance D, if (D,D') \=f M 
for a stable instance D' , then D' is expected to be reached 
as a fix-point of an iteration of value identification updates 
that starts from D and is based on M. 

3. MD SEMANTICS REVISITED 

Condition (b) in Definition 1 is used to avoid that the 
identification updates destroy the original similarities. Un- 
fortunately, enforcing the requirement sometimes leads to 
counterintuitive results. 

Example 2. Consider the following instance D with string- 
valued attributes, and MDs: 



R 


A 


B 


C 


S 


E 


F 




a 


c 


9 




h 


c 




a 


c 


ksp 




msp 


c 



R[A] 
R[C] 



R[A] 
■■S[E] 



R[C] ^ R[C] 
R[B] ^ S[F] 



(3) 
(4) 



For two strings si and S2, si ~ S2 if the edit distance d 
between si and S2 satisfies d < 1. To produce an instance 
D' satisfying (D,D') \=f M, the strings g and ksp must be 
changed to some common string s' . 

Because of the similarities h « g and ksp « msp, s' must 
be similar to the E attribute values of the tuples in S, by 
condition (b) of Definition 1 and MD (4). Clearly, there is 
no s' that is similar to both h and msp. Therefore, at least 
one of h and msp must be modified to some new value in 
D'. D 

Another problem with the semantics of MDs is that it allows 
duplicate resolution in instances that are already resolved. 
Intuitively, there is no reason to change the values in an in- 
stance that is stable for a set of MDs M, because there is no 



reason to believe, on the basis of M, that these values are in 
error. However, even if an instance D satisfies (D, D) \=f M, 
it is always possible, by choosing different common values, 
to produce a different instance D' such that (D,D') \=p M. 
This is illustrated in the next example. 

Example 3. Let D be the instance below and the MD 
R[A] « R[A] -> R[B] ^ R[B\. 



R 


A 


B 




a 


c 




a 


c 



Although D is stable, (D, D') Nf mis true for any D' where 
the B attribute values of the two tuples are the same. □ 

3.1 MD satisfaction 

We now propose a new semantics for MD satisfaction that 
disallows unjustified attribute modifications. We keep con- 
dition (a) of Definition 1, while replacing condition (b) with 
a restriction on the possible updates that can be made. 

Definition 2. Let D be an instance of schema S, R £ S, 
t R G R{D), C an attribute of R, and M a set of MDs. 
Value t R [C] is modifiable if there exist S G S, ts G S(D), 
an m G M of the form R[A] w S[B] -> R[C] ^ S[E], and 
a corresponding pair (C,E) of (C,E), such that one of the 
following holds: 1. t R [A] w t s [B], but t R [C] t s [E]. 2. 
tR[A] w ts[B] and ts[E] is modifiable. □ 

Example 4- Consider two relations R and S with two MDs 
defined on them: 



nil 



R 


A 


B 


S 


C 


E 




to 


ao 


b 




Ct3 


c 




tl 


01 


b 


u 


a± 


c 




t 2 


ffl 2 


b 


ts 


«5 


c 




: R[A] n 


R[A 


] R[B]^R[B], 



m 2 : R[A] w S[C] 



R[B] ^ S[E]. 



The following similarities hold on the distinct constants of 
R and S: a; « o-(i+i)mod6i < i < 5. The values t2[B] 
and ts[E] are modifiable by condition 1 of Definition 2, 7712, 
«2 ~ fl3, and t2[B] 7^ ia[-E]. For the same reason, to[B] and 
ts[E] are modifiable. 

Value ti[B] is modifiable by condition 2 of Definition 2, 
mi, ai ~ 02, and the fact that ia[B] is modifiable. Similarly, 
tn[E] is modifiable. □ 

Definition 3. Let D, D' be instances for S with the same 
tuple ids, M a set of MDs, and m G M. (D, D') satisfies m, 
denoted (D,D') (= m, iff: 

1. For any pair of tuples tR G R(D), ts G S(D), if there 
exists an MD in M of the form R[A] w S[B] -> R[C] ^ 
S[E] and t/?[yl] ~ ts[B], then for the corresponding tuples 
4 G -R(D') and t' s G S(D'), it holds 4[C*] = t' s [S]. 

2. For any tuple tR G R(D) and any attribute G of R, if 
£r[G] is not modifiable, then t' R [G] = i K [G], □ 

Notice that the notion of satisfaction of an MD is relative 
to a set of MDs to which the former belongs (due to the 
modifiability condition). Of course, for a single MD m, we 
can consider the set M = {m}. Condition 2. captures a 



natural default condition of persistence of values: those that 
have to be changed are changed only. 

The definition of satisfaction of a set M of MDs, (D, D)' \= 
M, is as usual. Also, as before, we define stable instance for 
M to mean (D, D) \= M. Except where otherwise noted, 
these are the notions of satisfaction and stability that we 
will use in the rest of this paper. 

Example 5. Consider again example 4. The set of all D' 
such that (D,D') \= M is the set of all instances obtained 
from D by changing all values of R[B] and S[E] to a com- 
mon value, and leaving all other values unchanged. This is 
because the values of R[B] and S[E] are the only modifi- 
able values, and these values must be equal by condition 1 
of Definition 3 and the given similarities. □ 

Condition 2 in Definition 3 on the set of updatable values 
does not prevent us from obtaining instances D' that enforce 
the MD, as the following theorem establishes. 

Theorem 1. For any instance D and set of MDs M, there 
exists a D' such that (D, D') \= M. Moreover, for any at- 
tribute value that is changed from D to D' , the new value 
can be chosen arbitrarily, as long as it is consistent with 
(D, D') 1= M. a 

The new semantics introduced in Definition 3 solves the 
problems mentioned at the beginning of this section. No- 
tice that it does not require additional changes to preserve 
similarities (if the original ones were broken). Furthermore, 
modifications of instances, unless required by the enforce- 
ment of matchings as specified by the MDs, are not allowed. 
Also notice that the instance D' in Theorem 1 is not guaran- 
teed to be stable. We address this issue in the next section. 

Moreover, as can be seen from the proof of Theorem 1, 
the new restriction imposed by Definition 3 is as strong as 
possible in the following sense: Any definition of MD satis- 
faction that includes condition 1. must allow the modifica- 
tion of the modifiable attributes (according to Definition 2). 
Otherwise, it is not possible to ensure, for arbitrary D, the 
existence of an instance D' with (D, D') 1= M. 

3.2 Resolved instances 

According to the MD semantics in [15], although not ex- 
plicitly stated there, a clean version D' of an instance D is 
an instance D' satisfying the conditions (D,D') \= M and 
(D',D') \= M. Due to the natural restrictions on updates 
captured by the new semantics (cf. Definition 3), the ex- 
istence of such a D' is not guaranteed. Essentially, this is 
because D' is the result of a series of updates. The MDs are 
applied to the original instance D to produce a new instance, 
which may have new pairs of similar values, forcing another 
application of the MDs, which in their turn produces another 
instance, and so on, until a stable instance D' is reached. 
The pair (D, D') may not satisfy M. However, we will be 
interested in those instances D' just mentioned. The idea is 
to relax the condition (D, D') \= M, and obtain a stable D' 
after an iterative process of MD enforcement, which at each 
step, say k, makes sure that (Dk-i, -Dfe) \= M. 

Definition 4- Let D be a database instance and M a set 
of MDs. A resolved instance for D wrt M is an instance 
D' , such that there is a finite (possibly empty) sequence of 
instances Di,D 2 , ...D n with: {D,Di) N M, (D!,D 2 ) (= M,... 
(D n -i,D n ) (= M, (D n , D ) (= M, and (D',D') N M. □ 



Note that, by Definition 3, for an instance D satisfying 
(D, D) \= M, it holds (D, D ) (= M if and only if D' = D. 
In this case, the only possible set of intermediate instances 
is the empty set and D is the only resolved instance. Thus, 
a resolved instance cannot be obtained by making changes 
to an instance that is already resolved. 

Theorem 2. Given an instance D and a set M of MDs, 
there always exists a resolved instance of D with respect to 
M. □ 

Example 6. Consider the following instance D of a rela- 
tion R and set M of MDs: 



R(D) 


A 


B 


C 




a 


b 


d 




a 


c 


e 




a 


b 


e 



R[A] 

R[B]: 



:R[A] 

R[B] 



R[B] f± R[B], 
R[C] — R[C}. 



All pairs of distinct constants in R are dissimilar. Two re- 
solved instances D\ and D 2 of R are shown. 



R(Di) 


A 


B 


C 


R(D 2 ) 


A 


B 


C 




a 


b 


d 




a 


b 


e 




a 


b 


d 




a 


b 


e 




a 


b 


d 




a 


b 


e 



Notice that (D,Di) ^ M, because the value of the C at- 
tribute of the second tuple is not modifiable in D. □ 

The notion of resolved instance is one step towards the char- 
acterization of the intended clean instances. However, it still 
leaves room for refinement. Actually, the resolved instances 
that are of most interest for us are those that are somehow 
closest to the original instance. This consideration leads to 
the concept of minimal resolved instance, which uses as a 
measure of change the number of values that were modified 
to obtain the clean database. In Example 6, instance D 2 is 
a minimal resolved instance, whereas D\ is not. 

Definition 5. Let D be an instance. 

(a) Td '■= {{t, A) | t is the id of a tuple in D and A is an 
attribute of the tuple}. 

(b) /d : Td — ► U is given by: /o(i, ^4.) := the value for A 
in the tuple in D with id t. 

(c) For an instance D' with the same tuple ids as D: 
S D , D > := {(t,A) e T D \ f D (t,A) ± f D ,(t',A)}. □ 

Intuitively, Sd,d' is the set of all values changed in going 
from D to D' . ' 

Definition 6. Let D be an instance and M a set of MDs. 
A minimally resolved instance (MRI) of D wrt M is a re- 
solved instance D' such that ISd^'I is minimum, i.e. there 
is no resolved instance D" with \S d ,d"\ < \Sd,d'\- We de- 
note by Res(D, M) the set of minimal resolved instances of 
D wrt the set M of MDs. □ 

Example 7. Consider the instance below and the MD R[A] : 
S[C] -> R[B] — S[D}. 

B 



R 



A 



bi 



s 


c 


D 




Cl 





Assuming that a\ w ci, this instance has two minimal re- 
solved instances, namely 



R 


A 


B 


S 


c 


D 




Ol 


di 




Cl 


di 


R 


A 


B 


S 


C 


D 




ai 


bi 




Cl 


bi 



□ 



Considering that MDs concentrate on changes of attribute 
values, we consider that this notion of minimality is appro- 
priate. The comparisons have to be made at the attribute 
value level. Notice that in CQA a few other notions of min- 
imality and comparison of instances have been investigated 
[7]- 

3.3 Resolved answers 

Let Q(x) be a query expressed in the first-order language 
L(S) associated to schema 5. Now we are in position to 
characterize the admissible answers to Q from D, as those 
that are invariant under the matching resolution process. 

Definition 7. A tuple of constants a is a resolved answer 
to Q(x) wrt the set M of MDs, denoted D \= M Q[a], iff 
D' |= Q[d], for every D' € Res(D, M). We denote with 
ResAn(D, Q,M) the set of resolved answers to Q from D 
wrt M. a 

Example 8. (example 7 continued) The set of resolved an- 
swers to the query Qi(x, y) : R(x, y) is empty since there are 
no tuples that are in the instance of R in all minimal resolved 
instances. On the other hand, the set of resolved answers to 
the query Q 2 (a;) : 3y{R{x,y) A (y = bi Vy = di) is {oi}. □ 

In Section 4 we will study the complexity of the problem 
of computing the resolved answers, which we now formally 
introduce. 

Definition 8. Given a schema S, a query Q(x) £ L(S), 
and a set M of MDs, the Resolved Answer Problem (RAP) 
is the problem of deciding membership of the set 

RAq } m '■= {(D,a) | a is a resolved answer to Q from 
instance D wrt M}. 

If Q is a boolean query, it is the problem of determining 
whether Q is true in all minimal resolved instances of D. □ 

4. COMPUTING RESOLVED INSTANCES 
AND ANSWERS 

In this section, we consider the complexity of the RAq^m 
problem introduced in the previous section. For this goal it 
is useful to associate a graph to the set of MDs. We need a 
few notions before introducing it. 

Definition 9. A set M of MDs is in standard form if no 
two MDs in M have the same expression to the left of the 
arrow. □ 

Notice that any set of MDs can be put in standard form 
by replacing subsets of MDs of the form {-RL4] ~ S[B] — > 
R[Ci] ^ . . . , R[A] « S[B] ->• R[C n ] ^ S[E n ]} by the 

single MD R[A] « S[B] R[C] ^ S[E] L where the set of 
corresponding pairs of attributes of (C, E) is the union of 




Figure 1: An MD-Graph 



those of (Ci, -Ei), ...(On, E n ). From now on, we will assume 
that all sets of MDs are in standard form. 

For an MD m, LHS(m) and RHS(m) denote the sets of 
attributes that appear to the left side and to right side of 
the arrow, respectively. 

Definition 10. Let M be a set of MDs in standard form. 
The MD-graph of M, denoted MDG(M), is a directed graph 
with a vertex labeled m for each m £ M, and with an edge 
from mi to m 2 iff RHS(roi) f| LHS(m 2 ) ^ 0. □ 

Example 9. Consider the set of MDs: mi : R[A] ~ 
S[B] -> R[C\ — S[D\. m 2 : R[C\ w S[D] -»■ R[A\ ^ S[B\. 
m 3 : S[E] « S[B] -¥ T[F] ^ T[F]. It has the MD-graph 
shown in Figure 1. □ 

A set of MDs whose MD-graph contains edges is called in- 
teracting. Otherwise, it is non-interacting. 

Definition 11. (a) A cycle C in an MD-graph MDG(M) 
is called a simple cycle if for each pair (mi, m 2 ) of successive 
vertices in C, the corresponding pairs to the left of the arrow 
in m 2 are corresponding pairs to the right of the arrow in 
mi, and do not occur elsewhere in mi. 

(b) A set M of MDs is simple-cycle if its MD-graph MDG(M) 
is a simple cycle. □ 

Example 10. The following is a simple-cycle set of MDs. 

mi : R[A] w S[B] -> R[C, F] — S[E, G], 

m 2 : R[C] w S[E] A R[F] « S[G] -»■ R[A] ^ S[B\. 

The MD-graph is a cycle, because attributes in RHS(m 2 ) are 
in LHS(mi), and vice-versa. This cycle is a simple cycle, 
because the corresponding pairs (C, E) and (F, G) to the 
right of the arrow in mi are corresponding pairs to the left 
of the arrow in m 2 , and vice- versa. □ 

For this class of MDs it is easy to characterize the form an 
MRI takes. This is first illustrated with an example. 

Example 11. Consider the instance D (with tuple ids) and 
simple- cycle set of MDs. 
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R[A] « R[A\ -> R[B] - R[B], 
R[B] w R[B] -> R[A] ^ R[A\. 

The only similarities are: a; w aj , bi ~ bj , di ~ dj , ei ~ ej , 
with i,j £ {1, 2}. If the MDs are applied twice, successively, 
to the instance, one possible result is: 
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From this it is clear that, in any sequence of states D, Dx, D 2 , 
... obtained by applying the MDs, the updated instances 
must have the following pairs of values equal: 



Di, i odd 
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tuple (id) pairs 


(1,4), (2,3) 


(1.2), (3,4) 


Di, i even 


Column 








tuple (id) pairs 


(1,2), (3,4) 


(1,4), (2,3) 



In any stable instance, the pairs of values in the above tables 
must be equal. Clearly, this can only be the case if all values 
in the A and B columns are equal. This can be achieved with 
a single update, choosing any value as the common value. 
Thus, the MRIs of any instance are those with all values in 
the A and B columns set to their most common value. In 
the case of D above, there are 16 MRIs. □ 

The Algorithm ComputeMRI below generalizes the idea pre- 
sented in Example 11. It computes the set of all MRIs for 
the case of an arbitrary simple cycle. (The relevant defini- 
tions are given below). 

Definition 12. Let m be the MD R[A] « S[B] -> R[C] ^ 
S[E]. The transitive closure, T~, of ~ is the transitive clo- 
sure of the binary relation relation on tuples tx[A] « t2[B], 
where tx € R and t 2 G S. □ 

Notice that Definition 12 implies that the transitive closure 
of ~ is an equivalence relation on the tuples of R and S. 
It therefore forms a partition of these tuples into disjoint 
equivalence classes. 

Definition 13. For a set S of binary relations, the transi- 
tive closure, T s , of S is the transitive closure of the union 
of all relations in S. □ 

This definition can be applied, in particular, to the T~s in 
Definition 12, for several MDs. For the case in which S in 
Definition 13 is a set of equivalence relations, T s is also an 
equivalence relation. These definitions are used in Algorithm 
ComputeMRI in Table 1. 

Proposition 1. Algorithm ComputeMRI returns the set of 
all MRIs of D wrt a simple-cycle set M of MDs. □ 

With some minor modifications to the T relation in Algo- 
rithm ComputeMRI, we can make the latter work also for 
sets of MDs whose vertices in the MD-graph can occur on 
more than one simple cycle, as shown in Figure 2. 1 The 
following HSC class of sets of MDs extends the simple-cycle 
class. 



Table 1: Algorithm ComputeMRI 



Input: A database instance D and a simple-cycle set M 
of MDs. 

Output: Set of MRIs of D with respect to M. 



1) 

2) 
3) 

4) 

5) 

6) 

7) 
8) 
9) 
10) 

11) 



For 1 < j < n 

Compute Tj, the transitive closure of ~j 



U < 



Compute the transitive closure T of the set {Tj 
j < n} 

For each corresponding pair of attributes (A, B) 
that appears in M: 

For each equivalence class E defined by 
T: 

Choose a value v from among the A and B 
attribute values of tuples in R f] E and 
Sf]E, respectively, such that no other 
value occurs more frequently 
For each tuple t € E: 
t[A] ^viiteR 
t[B) <- v if t € S 
Repeat 4-9 for other choices of v to produce other 
MRIs 

Return the resulting set of MRIs 




1 The modification involves using the tuple-attribute closure 
introduced in Definition 17, for the cases where the MD- 
graph has more than one connected component. 



Figure 2: The MD-graph of an HSC set of MDs 



Definition 14- A set M of MDs is hit simple cyclic (HSC) 
iff each vertex in MDG(AI) is on at least one simple cycle 
of MDG(M). a 

The next example shows that even for simple classes of MDs, 
there may be exponentially many MRIs. 

Example 12. Consider the relational predicate R[A, B] and 
the MD m : R[A] w R[A] R[B] f± R[B]. Let D be an 
instance of R with tuples {tj| 1 < i < n}, for some even 
number n, such that: (a) The values in D satisfy the sim- 
ilarities ti[A] « for all odd i with 1 < i < n — 1, 
and no others, and (b) ti[B] ^ ti+x[B] for all odd i with 
1 < i < n — 1. It is clear that an MRI is obtained by setting 
the B attributes of ti and fi+i to either ti[B] or ti+x[B] for 
each odd i such that 1 < i < n — 1. The number of MRIs is 
the number of possible choices of such values, which is 2 n ^ 2 . 
□ 

The MD in the previous example is HSC. Actually, the sim- 
ple form of the MRIs for HSC sets can be used to obtain 
an upper bound for RAq, m that (under usual complexity- 
theoretic assumptions) is lower than exponential. This relies 
on the assumption that, if a resolved instance contains val- 
ues outside the active domain of the original instance, then 



those values are bounded above in length by a polynomial 
in the size of the original instance. This assumption is in 
accord with practical constraints on databases and any rea- 
sonable definition of similarity. 

Theorem 3. For HSC sets of MDs, if resolved instances 
are restricted to contain values bounded in length by a poly- 
nomial in the length of the input, then problem RAq,m is 
in coNP for any first-order query Q. □ 

In this section, we established a complexity bound for RAq,m 
which holds for class of MDs with cyclic MD-graphs and 
all first-order queries. The bound follows from the simple 
form of the MRIs, as described by Algorithm ComputeMRI. 
In the next section, we further exploit this latter result to 
show that, for HSC sets and certain first-order queries, the 
resolved answers can be retrieved in polynomial time. 

5. RESOLVED QUERY ANSWERING: 
TRACTABILITY AND REWRITING 

In this section, we discuss tractable cases of RAq^m. In 
particular, we propose a query rewriting technique for ob- 
taining the resolved answers for certain FO queries and MDs. 
In Section 6, we will relate RAq^m to consistent query an- 
swering (CQA) [7]. This connection and some known re- 
sults in CQA will allow us to identify further tractable cases, 
but also to establish the intractability of RAq,m for certain 
classes of queries and MDs. The latter makes the tractabil- 
ity results obtained in this section even more relevant. 

A possible approach to obtaining the resolved answers to a 
query Q from an instance D is to rewrite Q into a new query 
Q! on the basis of Q and M. Q' should be such that, when 
posed to D (as usual), it returns the resolved answers to Q 
from D. In this case, it is not necessary to explicitly compute 
the MRIs. If Q! can be efficiently evaluated against D, then 
the resolved answers can also be efficiently computed and 
RAq : m becomes tractable. This methodology was proposed 
in [3]' for CQA. 

This section investigates this query rewriting approach to 
the computation of resolved answers for HSC sets of MDs. 
The input queries Q will be conjunctive queries with certain 
restrictions on the joins. However, the rewritten queries Q' 
may involve aggregate operators (actually, Count), universal 
quantification, and Datalog rules (to specify the transitive 
closure). We will need to compute transitive closures and 
count the number of occurrences of values in order to enforce 
minimal change. In any case, the resulting query Q' will still 
be evaluable in polynomial time in the size of D. 

Specifically, the input queries we consider have the form 
Q(x) : 3u(Ri(vi) A ■ • ■ A R„(v„)), where x — (UVi) x u. For 
tractability of RAq.m, we need additional restrictions on 
them. 

Definition 15. (a) For a set M of MDs defined on schema 
S, the changeable attributes of <S are those that appear to 
the right of the arrow in some m G M. The other attributes 
of S are called unchangeable. 

(b) Let Q be a conjunctive query and M a set of MDs. 
Query Q is an unchangeable attribute join conjunctive query 
(ucajCQ) if there are no bound, repeated variables in Q that 
correspond to changeable attributes. □ 

Example 13. Let M be the single MD R[A] « R[A\ 
R[B] - R[B]. The query Q(x,z) : 3y(R(x,y) A R(z,y)) 



is not in the ucajCQ class, because it contains a bound, 
repeated variable (y) which corresponds to a changeable 
attribute (B). However, the query Q(y) : 3x3z(R(x, y) A 
R(x,z)) is in ucajCQ, since the only bound, repeated vari- 
able (x) corresponds to an unchangeable attribute (A). □ 

In Section 6 we will encounter HSC MDs (even non-interacting 
MDs) and conjunctive queries outside ucajCQ for which 
RAq : m is intractable (cf. Theorem 5 below). 

To incorporate counting into FO queries, we will use the 
operator Count(R) that returns the number of tuples in re- 
lation R (cf. [1]). Count will be applied to sets of tuples of 
the form {i | C}, where i is a tuple of variables, and and C is 
a FO condition whose free variables include those in t. Now 
we show a simple example of rewriting that uses Count. 

Example 14. Consider a relation R, the MD R[A\ « R[A] — > 
R[B] f± R[B], and the query Q(x,y,z) : R(x,y,z). R and 
its (single) MRI are shown below. 
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The set of resolved answers to Q is {(ai , b 2 , Ci), (ai, 62, C2), 
(ai, 62, C3)}. It is not difficult to see that the following query 
returns the resolved answers (for any initial instance of R). 
In it, T stands for the transitive closure T~ of ~ (cf. Defi- 
nition 12). 

Q'(x,y,z) :3y'R(x,y ,z) f\\/y" [Count {(x ,y,z') 
T((x,y',z),(x',y, z')) AR(x',y,z')} > Count {{x ,y" , z) \ 
T((x, y',z), (x, y", z')) A R(x , y" , z) A y" y}]. 

Intuitively, the first conjunct requires the existence of a tu- 
ple t with the same A and C attribute values as the answer. 
Since the values of these attributes are not changed when go- 
ing from the original instance to an MRI, such a tuple must 
exist. However, the tuple is not required to have the same B 
attribute value as the answer tuple, because this attribute 
can be modified. For example, (01,62,01) is a resolved an- 
swer, but is not in R. What makes it a resolved answer is 
the fact that it is in an equivalence class of T (consisting of 
all three tuples in R) for which b 2 occurs more frequently as 
a B attribute value than any other value. This condition on 
resolved answers is expressed by the second conjunct. □ 

For simplicity, we present our query rewriting algorithm for 
non-interacting MDs, a special case of HSC sets of MDs 
where the connected components have only one vertex. The 
generalization to arbitrary HSC sets is straightforward, and 
the required modifications are indicated at the end of this 
section. First we require the following definitions. 

Definition 16. Let M be a set of MDs on schema S. (a) 
Define a (symmetric) binary relation ^ r which relates at- 
tributes R[A], S[B] of S if there is an MD in M where 
R[A] ^ S[B] appears to the right of the arrow. 

(b) The attribute closure, T a t , of M is the binary relation on 
attributes defined as the reflexive, transitive closure of 

(c) We use the notation E R ^ to denote the equivalence class 
of T a t to which attribute R[A] belongs. □ 

Note that, in general, there will be pairs of attributes J?[j4], 
S[B] for which E R[A] = E S[B] . 



Table 2: Algorithm Rewrite 



Input: 


A query in ucajCQ and non-interacting set of 


MDs M. 




Output 


The rewritten query Q! . 


1) 


Let Q(t) : Bit Ai<i<„ Ri(vi) be the query. 


2) 


For each Ri(vi) 


3) 




T i / — 1 Til j f" 1 11 i L '1 i f* 7~i 

Let C be the set or changeable attributes of Ri 
corresponding to a free variable in Vi 


4) 




If C is empty 


5) 




Qi(vi) 4- Ri(vi) 


6) 




Else 


7) 
8) 




Let v'i be Vi with each variable v%a in 

holding the value of an attribute A € C 
replaced by a new variable v[a 
Let Vic be the vector of variables ViA, 
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Generate atom Rj(ujj k ), where 
all variables in Wj k are labelled 
as in Ujk except the one holding 
the value of Rj[Bk], which is v'/ A 


14) 




Cf k <- Count{u ik \TS{v'i,Ri[A}, 
u jk ,Rj[B k ]) ARj(u jk )} 
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> ^ k Cf k 2 }} 


17) 


Q'(t) <- 3u Ai< 4 <„ Qi(vi) 


18) 


return Q' 



Example 15. Let M be the set of MDs 

R[A] «i -> 7?[C*] ^ 5[D], 

^ 2 T[F] A 5[G] « T[H] -> S[D, K] ^ T[J, L], 
T[F] « 3 T[H] -> T[L, AT] ^ T[M, P]. 

The equivalence classes of T a t are -E_r[c] = {-R[C] , S[D] ,T[J]}, 
E S[K] = {S[K], T[L],T[M]}, and E T[JV] = [T[N],T[P]}. □ 

To describe the MRIs in this case, we need the transitive 
closure relation defined below. 

Definition 17. Let m be the MD R[A] « -> ^ 
S[E]. 

(a) Let «' be the following binary relation on tuple-attribute 
pairs: (ti,C) «' (t 2 ,E)_:tf ti[A] « i 2 [B] and (C, £) is a 
corresponding pair of (C,E). 

(b) The tuple- attribute closure TA of m is the reflexive, tran- 
sitive closure of ~'. □ 

We denote by the transitive closure of a set of tuple- 
attribute closures (cf. Definition 13). TS partitions the set 
of tuple/attribute pairs into disjoint equivalence classes. 

To keep the notation simple, we omit parentheses delimit- 
ing tuples and tuple/attribute pairs when writing the argu- 
ments of TA and TS. For example, for tuples t 2 = (a, b, c) 
and ts = (d, e, /) with attributes A and C, respectively, 
TS(((a, b, c), A), ((d, e, /), C)) is written as TS(a, b, c, A, d, e, 
f, C). 

Algorithm Rewrite in Table 2, outputs a rewritten query 
Q! that returns the resolved answers to a given input con- 
junctive query Q and set of non-interacting MDs. This is 
done by separately rewriting each conjunct Ri(i)i) in Q. If 
Ri(vi) contains no free variables, then it is unchanged (line 
5). Otherwise, it is replaced with a conjunction involving the 
same atom and additional conjuncts which use the Count 
operator. The conjuncts involving Count express the con- 
dition that, for each changeable attribute value returned by 
the query, this value is more numerous than any other value 
in the same set of values that is equated by the MDs. The 
Count expressions contain new local variables as well as a 
new universally quantified variable v'/a- 

Example 16. We illustrate the algorithm with predicates 
R[ABC],S[EFG],U[HI], the query Q(x,y,z) : 3t,u,p,q 
(R(x,y,z) A S(t,u,z) A U(p,q)); and the MDs: R[A] » 
S[E] R[B] t— S[F] and S[E] « U[H] -»■ S[F] ^ U[I}. 

Since the S and U atoms have no free variables holding 
the values of changeable attributes, these conjuncts remain 
unchanged (line 5). The only free variable holding the value 
of a changeable attribute is y. Therefore, line 7 sets v[ to 
(x,y',z). Variable y contains the value of attribute R[B]. 
The equivalence class E R y B \ of T at to which R[B] belongs is 
S[F], U[I]}, so the loop at line 11 generates the atoms 
R(x',y,z'), R(x',y",z'), S(t',y,z'), S(t',y",z'), U(p',y), 
U(p' ,y"). The rewritten query is obtained by replacing in 
Q the conjunct R(x, y, z) by 3y'(R(x, y', z) AVy"[ 

Count{(x',y,z')\ TS(x,y' , z, R[B], x ,y, z , R[B]) A 
R(x',y,z')} + Count{(t',y,z')\ TS(x, y' , z, R[B], 
t', y, z', S[F\) A S(t', y, z')} + Count{(p' , y) \ TS(x, y' , z, 
R[B],p',y,U[r])AU(p',y)} > 



Count{(x' , y ' , z )| 

TS(x, y', z, R[B],x', y" , z' , R[B]) A R(x' , y" , z') A 
y" ^y}+ Count{(t', y" , z')\ TS(x, y' , z, R[B],t', y" , 
z', S[F]) A S(t',y", z') A y" + y) + Count{(p' , y")\ 
TS(x, y', z, R[B],p', y" , U[I]) A U(p', y") A y" ? y}}. □ 

Theorem 4- For a set M of non-interacting MDs and a 
query Q in the class ucajCQ, the query Q' computed by 
Algorithm Rewrite returns the resolved answers to Q when 
posed to any instance. □ 

As expected, the rewriting algorithm that produced the rewrit- 
ten query does not depend upon the dirty instance at hand, 
but only on the MDs and the input query, and runs in poly- 
nomial time. 

Algorithm Rewrite can be easily adapted and extended 
to handle HSC sets of MDs. All that is required is a mod- 
ification to the tuple-attribute closure in Definition 17, as 
follows: For an HSC set of MDs M and m G M, a pair of 
tuples ti and t 2 satisfies (ti,C) «' (t 2 ,E) iff ti[A] w t 2 [B] 



and (C, E) appears as a corresponding pair to the right of 
the arrow in some MD in the same connected component of 
the MD graph as m. Tuple-attribute closure is redefined as 
the transitive closure of this new relation. As with Theo- 
rem 4, the correctness proof is based on the simple form of 
the MRIs, and is proved using the same technique as in the 
proof of Proposition 1. 

6. THE CQA CONNECTION 

MDs can be seen as a new form of integrity constraint 
(IC). An instance D violates an MD m if there are unre- 
solved duplicates, i.e. tuples ti and t-z in D that satisfy 
the similarity condition of m, but differ on some pair of at- 
tributes that are matched by m. The instances that are 
consistent with a set of MDs M are resolved instances of 
themselves with respect to M. Among classical ICs, the 
closest analogues of MDs are functional dependencies (FDs). 

Given a database instance D and a set of ICs E, possi- 
bly not satisfied by D, consistent query answering (CQA) 
is the problem of characterizing and computing the answers 
to queries Q that are true in all the instances D' that are 
consistent with E and minimally differ from D [3] . The con- 
sistent instances D' are called repairs. Minimal difference 
can be defined in different ways. Most of the research in 
CQA has concentrated on the case where the symmetric dif- 
ference of instances, as sets of tuples, is made minimal under 
set inclusion [3, 7, 11]. However, also the minimization of 
the cardinality of this difference has been investigated [20, 
2]. Other forms of minimization measure the differences in 
attribute values between D and D' [17, 21, 16, 8]. Because 
of their practical importance, much work on CQA has been 
done for the case where E is a set of functional dependencies 
(FDs), in particular, key constraints (KCs) [12, 18, 23, 22, 
24]. 

Actually, for a set of KCs K, and repairs based on tuple 
deletions, a repair D' of an instance D can be characterized 
as a maximal subset of D that satisfies K: D' C D, D' \= K. 
and there is no D" with D' <= D" C D, with D" \= K, [12]. 

Now, for a FO query Q(x) and a set of KCs /C, the consis- 
tent query answering problem is about deciding membership 
of the set 

C'QAq k = {(D, a) | a is an answer to Q in all repairs of 

D with respect to /C}. 

A a satisfying the above is called a consistent answer to Q 
from D. 

Notice that this notion of minimality involved in repairs 
wrt FDs is tuple and set-inclusion oriented, whereas the one 
related to MRIs (cf. Definition 6) is attribute and cardinality 
oriented. However, the connection can still be established. 
In particular, the following result can be obtained from [12, 
Thm. 3.3]. 

Theorem 5. Consider a relational predicate R[A, B,C], 
the MD 

m: R[A] = R[A] -+R[B,C] ^R[B,C\, (5) 

and the query Q: 3x3y3y'3z(R(x,y, c)AR(z, y' , d) Ay = y'). 
iL4g { m } is coTVP-complete. □ 

Notice that the conjunctive query in this result does not 
belong to the ucajCQ class. 



For certain classes of conjunctive queries and ICs consist- 
ing of a single KC per relation, CQA has been proved to 
be tractable. This is the case for the Cf ore8t class of con- 
junctive queries [18]. Actually, for this class there is a FO 
rewriting of the original query that returns the certain an- 
swers. Cjorest excludes repeated relations and allows joins 
only between non-key and key attributes. Similar results 
were subsequently proved for a larger class of queries that 
includes some queries with repeated relations and joins be- 
tween non-key attributes [23, 22, 24]. The following result 
allows us to take advantage of tractability results for CQA 
in our MD setting. 

Proposition 2. Let D be a database instance with a single 
relation R. Let m be a MD of the form R[A] = R[A] -¥ 
R[B] f=s R[B], where the set of attributes of R is A\JB 
and Af]B = 0. Then there is a polynomial time reduction 
from RAQ^ m y to CQAq r K y, where k is the key constraint 
A -¥ B. a 

Proposition 2 can be easily generalized to several relations 
with one such MD defined on each. The reduction takes an 
instance D for RAQi m \ and produces an instance D' for 
C'QAq r K y. The schema of D' is the same for D, but the 
extensions of the relational predicates in it are changed wrt 
D via counting. Since definitions for those aggregations can 
be included (or inserted) in the query Q, we obtain: 

Theorem 6. Let S be a database schema with relation 
predicates Ri, 1 < i < n with a set K, of KCs «j : i?iL4*] ~^ 
Ri[Bi], 1 < i < n. Let Q be a FO query, and suppose there 
exists a polynomial time computable FO query Q' , such that 
Q! returns the consistent answers to Q from D. Then there 
exists a polynomial time computable FO query Q" with ag- 
gregation that returns the resolved answers to Q from D wrt 
the MDs m, : Ri[Ai] = R l [A l \ -> R,[Bi\ ^ Ri[Bi], 1 < 
i < n. □ 

The aggregation in Q" in Theorem 6 arises from the trans- 
formation of the instance that is used in the reduction in 
Proposition 2. We emphasize that Q" is not obtained using 
algorithm Rewrite from Section 5, which is not guaranteed 
to work for queries outside the class ucajCQ. Rather, a 
first-order transformation of the Ri relations with Count is 
composed with Q' to produce Q". Similar to Algorithm 
Rewrite in Section 5, they are used to express the most fre- 
quently occurring values for the changeable attributes for a 
given set of tuples with identical values for the unchangeable 
attributes. 

This theorem can be applied to decide/compute resolved 
answers through composition in those cases where a FO 
rewriting for CQA has been identified. In consequence, it 
extends the tractable cases identified in Section 5. They can 
be applied to queries that are not in ucajCQ. 

Example 17. The query Q : 3x3y3z(R(x,y) A S(y,z)) 
is in the class Qorest for relational predicates R[A, B] and 
S[C,E) and FDs A ->• B and C -> E. By Theorem 6 and 
the results in [18], this implies the existence of a polynomial 
time computable FO query with counting that returns the 
resolved answers to Q wrt MDs R[A] = R[A] R[B] ^ 
R[B] and S[C] = S[C] -> S[E] ^ S[E\. Notice that Q is 
not in ucajCQ, since the bound variable y is associated with 
the changeable attribute R[B]. □ 



7. CONCLUSIONS 

In this paper we have proposed a revised semantics for 
matching dependency (MD) satisfaction wrt the one origi- 
nally proposed in [15]. The main outcomes from that seman- 
tics are the notions of minimally resolved instance (MRI) 
and resolved answers (RAs) to queries. The former capture 
the intended, clean instances obtained after enforcing the 
MDs on a given instance. The latter are query answers that 
persist across all the MRIs, and can be considered as robust 
and semantically correct answers. 

We investigated the new semantics, the MRIs and the 
RAs. We considered the existence of MRIs, their number, 
and the cost of computing them. Depending on syntactic 
criteria on MDs and queries, tractable and intractable cases 
of resolved query answering were identified. The tractable 
cases coincide with those where the original query can be 
rewritten into a new, polynomial-time evaluable query that 
returns the resolved answers when posed to the original in- 
stance. It is interesting that the rewritings make use of 
counting and recursion (for the transitive closure). The 
original queries considered in this paper are all conjunctive. 
Other classes of queries will be considered in future work. 

Many of our results apply to cases for which the resolved 
instances can be obtained after a single (batch) update oper- 
ation. The investigation of cases requiring multiple updates 
is a subject of ongoing research. We have obtained sev- 
eral tractability and intractability results. However, under- 
standing the complexity landscape requires still much more 
research. 

We established interesting connections between resolved 
query answering wrt MDs and consistent query answers. 
There are still many issues to explore in this direction, e.g. 
the possible use of logic programs with stable model seman- 
tics to specify the MRIs, so as it has been done with database 
repairs [4, 5, 19]. 

We have proposed some efficient algorithms for resolved 
query answering. Implementing them and experimentation 
are also left for future work. Notice that those algorithms 
use different forms of transitive closure. To avoid unaccept- 
ably slow query processing, it may be necessary to compute 
transitive closures off-line and store them. The use of Dat- 
alog with aggregate functions should also be investigated in 
this direction. 

In this paper we have not considered cases where the 
matchings of attribute values, whenever prescribed by the 
MDs' conditions, are made according to matching functions. 
This element adds an entirely new dimension to the seman- 
tics and the problems investigated here. It certainly deserves 
investigation. 
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APPENDIX 

A. AUXILIARY RESULTS AND PROOFS 

Proof of Theorem 1: Consider an undirected graph G 
whose vertices are labelled by pairs (t,A), where t is a tu- 
ple identifier and A is an attribute of t. There is an edge 
between two vertices (s, A) and (t, B) iff s and t satisfy the 
similarity condition of some MD m G M such that A and B 
are matched by m. 

Update D as follows. Choose a vertex (ti , A) such that 
there is another vertex (t,2,B) connected to (ti,A) by an 
edge and t\[A] and t2[B] must be made equal to satisfy the 
equalities in condition 1. of Definition 3. For convenience 
in this proof, we say that is unequal to t\ for such a pair 
of tuples t\ and t-z- Perform a breadth first search (BFS) 
on G starting with (ti, A) as level 0. During the search, if a 
tuple is discovered at level i + 1 that is unequal to an adja- 
cent tuple at level i, the value of the attribute in the former 
tuple is modified so that it matches that of the latter tuple. 
When the BFS has completed, another vertex with an adja- 
cent unequal tuple is chosen and another BFS is performed. 
This continues until no such vertices remain. It is clear that 
the resulting updated instance D' satisfies condition 1. of 
definition 3. 

We now show by induction on the levels of the breadth first 
searches that for all vertices (t, A) visited, t[A] is modifiable. 
This is true in the base case, by choice of the starting vertex. 
Suppose it is true for all levels up to and including the i th 
level. By definition of the graph G and condition 2. of 
definition 2, the statement is true for all vertices at the (i + 
l) th level. This proves the first statement of the theorem. 

To prove the second statement, we show that, to satisfy 
condition 1. of Definition 3, the attribute values represented 
by each vertex in each connected component of G must be 
changed to a common value in the new instance. The state- 
ment then follows from the fact that the update algorithm 
can be modified so that the attribute value for the initial 
vertex in each BFS is updated to some arbitrary value at 
the start (since it is modifiable). By condition 1. of Defini- 
tion 3, the pairs of values that must be equal in the updated 
instance D' correspond to those vertices that are connected 
by an edge in G. This fact and transitivity of equality imply 
that all attribute values in a connected component must be 
updated to a common value. □ 

Proof of Theorem 2: We give an algorithm to compute a 
resolved instance, and use a monotonicity property to show 
that it always terminates. For attribute domain d in D, 
consider the set S d of pairs (t, A) such that attribute A of 
the tuple with identifier t has domain d. Let {Si, S 2 , ...S„} 
be a partition of S d into sets such that all tuple/attribute 
pairs in a set have the same value in D. Define the level of 
(t,A) to mean \Sj\ where (t, A) G Sj. 

The algorithm first applies all MDs in M to D by setting 
equal pairs of unequal values according to the MDs. Specifi- 
cally, consider a connected component C of the graph in the 
proof of Theorem 1. If the values of t[A] for all pairs (t, A) 



in C are not all the same, then their values are modified to 
a common value which is that of the pair with the highest 
level. This update is allowed by Theorem 1. In the case 
of a tie, the common value is chosen as the largest of the 
values according to some total ordering of the values from 
the domain that occur in the instance. It is easily verified 
that this operation increases the sum over all the levels of 
the elements of S d , where d is the domain of the attributes 
of the pairs in C. These updates produce an instance Di 
such that (D,Di) (= M. 

The MDs of M are then applied to the instance D\ to 
obtain a new instance D2 such that (Di,D2) 1= M and so 
on, until a stable instance is reached. For each new instance, 
the sum over all domains d of the levels of the (t, A) G S d is 
greater than for the previous instance. Since this quantity 
is bounded above, the algorithm terminates with a resolved 
instance. □ 

For the proof of Proposition 1, we need an auxiliary result. 

Lemma 1. Let D be an instance and let m be the MD in 
Definition 12. Let T be the transitive closure of «. An in- 
stance D' obtained by changing modifiable attribute values 
of D satisfies (D, D') N m iff for each equivalence class of T, 
there is a constant vector v such that, for all tuples t in the 
equivalence class, 

t'[C] =v iiteR(D) 

t'[E) = v ate s(D) 

where t' is the tuple in D' with the same identifier as t. 

ProofiSuppose (D,D') t= m. By Definition 3, for each pair 
of tuples ti G R(D) and t 2 G S{D) such that ti[A] » t 2 [B], 

A[C] = t' 2 [E] 

Therefore, if T(ti,t2) is true, then t{ and t' 2 must be in 
the transitive closure of the binary relation expressed by 
t'i[C] = t' 2 [E]. But the transitive closure of this relation is 
the relation itself (because of the transitivity of equality). 
Therefore, t'^C] = t' 2 [E]. The converse is trivial. □ 

Proof of Proposition 1: Consider an input D, M to Com- 
puteMRI with M a simple-cycle set of MDs given by 

R[Ao] ~o S[Bo] R[A' ] - S[B' ] 
R[Ai] »i S[B!] R[A[] f± S[B'x] 



R\A n -i\ S*[B„-i] ->■ R[K-i] - S[B' n _j] 

Let Tj denote the transitive closure of the relation Let 
Di denote an instance obtained by updating D i times ac- 
cording to M, and for a tuple t G D, denote the tuple with 
the same identifier in Di by t r . By Lemma 1 and straight- 
forward induction, it can be seen that, after D has been 
updated i times, i > 1 2 according to M to obtain an in- 
stance Di, for all tuples f in a given equivalence class E of 
Tj, 

**[-^y+>-i) mod „] = if t G R(D) (6) 

^[Bfj+i-i) mod n ] = Vij if t € S(D) (7) 

2 We use the term "update" even if a resolved instance is 
obtained after fewer than i modifications. In this case, the 
"update" is the identity mapping on all values. 



for some vector of values vfj. Let D' be a resolved instance. 
D' satisfies the property that any number of applications of 
the MDs does not change the instance. Therefore, D' must 
satisfy (6) and (7) for all i. That is, for any Tj, 1 < j < n, for 
any equivalence class of Tj , for all tuples t in the equivalence 
class, and for 1 < i < n, 



t'[A' i ] = vf ] ifte-R(D) 

-E 



V 



if t e S(D) 



(8) 
(9) 



for some vector of values vfj , where t' is the tuple in D' with 
the same identifier as t. 

Let T be the transitive closure of the set {Tj 1 < j < n} 
(cf. definition 13). By (8) and (9), for any pair of tuples ti 
and t 2 satisfying T(ti,t 2 ), t[ and t' 2 must satisfy T'(ti,t' 2 ), 
where T' is the transitive closure of the binary relation on 
tuples expressed by ti[Aj] = t' 2 [Bi], 1 < i < n. Since the 
equality relation is closed under transitive closure, this im- 
plies the following property: 



T{tx,t 2 ) implies t[ [A ■] = t' 2 [B[], 1 < i < n 



(10) 



It remains to show that the instances produced by Com- 
puteMRI are resolved instances. That they are the MRIs will 
then follow from the fact that they have the fewest changes 
among all instances satisfying (10). For any equivalence 
class E of T, let vf be a list of values chosen by Com- 
puteMRI as the common values for the pair of attribute lists 
(A'ijB'i) for tuples in E. To obtain the instance output by 
ComputeMRI for this choice of values, D can be updated as 
follows. For the i th update, if the values of the attributes 
A'i and B[ must be modified to achieve (6) and (7), take 
Vij=W, where E' is the equivalence class of T that con- 
tains the equivalence class E of Tj. Note that such an E' 
always exists, and the assignment of values is consistent since 
overlapping equivalence classes T and Tj will be contained 
in the same equivalence class of T. Then after n updates, 
the resulting instance satisfies (10), with common values as 
chosen by ComputeMRI. 

We must show that the resolved instance produced by 
this update process is the same instance that ComputeMRI 
returns for the given choice of update values. For any in- 
termediate instance I obtained in this update process, let ti 
denote the tuple in / with the same identifier as t. We will 
show by induction on the number of updates that were made 
to obtain I that for any i, whenever Ti(ti,t'j) for tuples t 
and t' , it holds that T{t, t'). This implies that updates made 
to t[A] for tuple t and attribute A can only set it equal to 
the common value for the equivalence class of T to which t 
belongs. Since ComputeMRI also sets t[A] to this value, this 
will prove the theorem. 

By definition of T, if updates were used to obtain J, 
Ti(ti,ti) implies T(t,t') implies T(t,t'). Assume it is true 
for instances obtained after at most k updates. Let / be an 
instance obtained after k + 1 updates. Suppose for the sake 
of contradiction that there exist tuples ti and t\ such that 
for some i, T^ti,^) but ^T(t,t'). Since ->T(t,t?) implies 
—Ti(t, t'), at least one of t[A'i] and t'[B'i\ was updated so that 
Ti(ti,t'j). We will assume that only t[A'i] was updated. The 
other cases are similar. Then it must have been updated to 
t"[A'i] or t"[Bi] for some t" £ R or t" € S, respectively, such 
that, for the instance /' on which the update was performed, 
it holds that T^ij/,*",) and Ti(t' v ,t",). By the induction 
hypothesis, T(t,t") and T(t',t"), which by the transitivity 



of T implies T(t,t'), a contradiction. □ 

Proof of Theorem 3: If it can be verified in polynomial 
time that an instance is an MRI of a given instance wrt a 
set M of MDs, then RA QM is in co-NP for any FO Q. This 
is because, for a given instance {D,t) of RAq^m, t can be 
shown not to be a certain answer by guessing an instance 
D' , verifying that it is an MRI, and verifying that t is not 
an answer to Q for D' . Algorithm ComputeMRI can eas- 
ily be modified to produce such a polynomial time verifier: 
compute the transitive closure relation T but instead of set- 
ting values equal, check that they are equal in the candidate 
MRI. □ 

Lemma 2. Let M be a non-interacting set of MDs of the 
form 

mi : »i R 2 [B : ] -> Ri[A 2 ] - R 2 [B 2 ] 

m 2 : R 3 [A 3 ] ^2 R4Bt} -> R :i [A 4 ] ^ R 4 [B 4 ] 

m n : R2n-l[A 2n -i] ~„ R2n[B 2n -l] — > 

R2n-l[A 2n ] ^ R2n[B 2n ] 

Let T be the transitive closure of the set {TAi,TA 2 , ...TA n }, 
where TAi is the tuple-attribute closure of rrii. Then, for 
any instance D, an instance D' obtained by updating mod- 
ifiable values of D is a resolved instance of D iff whenever 
T(ti,A,t 2 ,B), ti[A] = t' 2 [B], where t! is the tuple in D' with 
the same identifier as t. 

Proof: Suppose D' is a resolved instance. Since M is non- 
interacting, this implies (D,D') \= M. It is a corollary of 
Lemma 1 that whenever T(t 4 , A,t 2 , B), t\[A] = t' 2 [B], for 
all 1 < i < n. The converse follows from the fact that, 
whenever a pair of tuples t\ and t 2 satisfies the similarity 
condition of an MD, T(ti, A,t 2 ,B) for every pair (A,B) of 
matched attributes in the MD. □ 

Corollary 1. Let D be an instance and M a set of non- 
interacting MDs. Let T be the transitive closure of the set 
of tuple-attribute closures of the MDs in M. Then the set 
of MRIs is obtained by setting, for each equivalence class E 
of T, the value of each attribute in E to a value that occurs 
in E at least as frequently as any other value in E. □ 

Proof of Theorem 4: We express the query in the form 

Q(y) = 3zQ 1 (z,y) (11) 

Let Xij denote the variable of z or y which holds the value of 
the j attribute in the i th conjunct Ri in Q 4 . Denote this 
attribute by Aij. Note that, since variables and conjuncts 
can be repeated, it can happen that Xij is the same variable 
as Xki for (i, j) ^ (k, I), that Aij is the same attribute as A^i 
for ^ (k, I), or that Ri is the same as Rj for i ^ j. Let 
B and F denote the set of bound and free variables in Qi, 
respectively. Let C and U denote the variables in Qi hold- 
ing the values of changeable and unchangeable attributes, 
respectively. Let Q'(y) denote the rewritten query returned 
by algorithm Rewrite, which we express as 

Q'(y) = BzQ' 1 (z,y) 

We show that, for any constant vector a, Q'(a) is true for 
an instance D iff Q(a) is true for all MRIs of D. 



Suppose that Q'(a) is true for an instance D. Then there 
exists a b such that Qi(a,b). We will refer to this assign- 
ment of constants to variables as Aqi. From the form of 
Q! , it is apparent that, for any fixed i, there is a tuple 
tj — Ci = (cji,Cj2, "-(Hp) such that Ri(fi) is true in D with 
the following properties. 

1. For all Xij except those in Ff|C, Cij is the value as- 
signed to Xij by Aqi. 

2. For all Xij £ F P| C, there is a tuple with attribute 
B such that T(ti, Aij,t%, B), where T is the transitive 
closure of the tuple-attribute closures of the MDs in M, 
such that the value of t? [B] is the value assigned to Xij 
by Aqi. Moreover, this value occurs more frequently 
than that of any other tuple/attribute pair in the same 
equivalence class of T. 

For any given MRI D' , consider the tuple t' x in D' with the 
same identifier as t\. Clearly, this tuple will have the same 
values as ti for all unchangeable attributes, which by 1., are 
the values assigned to the variables Xij £ U. Also, by 2. 
and Corollary 1, for any j such that Xij £ Ff]C is free, 
the value of the j th attribute of t\ is that assigned to Xij by 
Aq,. 

Thus, for each MRI D' , there exists an assignment Aq of 
constants to the Xij that makes Q true, and this assignment 
agrees with Aqi on all Xij g' B f] C. This assignment is 
consistent in the sense that, if Xij and Xki are the same 
variable, they are assigned the same value. Indeed, for Xij 
Bf]C, consistency follows from the consistency of Aqi , and 
for Xij £ Bf]C, it follows from the fact that the variable 
represented by Xij occurs only once in Q, by assumption. 
Therefore, Q(a) is true for all MRIs D', and o is a resolved 
answer. 

Conversely, suppose that a tuple a is a resolved answer. 
Then, for any given MRI D' there is a satisfying assignment 
Aq to the variables in Q such that z as defined by (11) is 
assigned the value a. We write Q! in the form 



Q'(y) <- 3zAi< 



h(vi) 



(12) 



with Qi the rewritten form of the i l conjunct of Q. For any 
fixed i, let t' — (c'n, c' i2 , ...Ci p ) be a tuple in D' such that Cy 
is the constant assigned to Xij by Aq. 

We construct a satisfying assignment Aqi to the free and 
existentially quantified variables of Q' as follows. Consider 
the conjunct Qi of Q! as given on line 16 of Rewrite. Assign 
to v'i the tuple t in D with the same identifier as t'. This 
fixes the values of all the variables except those Xij £ Ff]C, 
which are set to c^ . It follows from Corollary 1 that Aqi 
satisfies Q' . Since Aq and Aqi match on all variables that 
are not local to a single Qi, Aqi is consistent. Therefore, a 
is an answer for Q' on D. □ 

Proof of Theorem 5: Hardness follows from the fact that, 
for the instance D resulting from the reduction in the proof 
of Theorem 3.3 in [12], the set of all repairs of D with re- 
spect to the given key constraint is the same as the set of 
MRIs with respect to (5). The key point is that attribute 
modification in this case generates duplicates which are sub- 
sequently eliminated from the instance, producing the same 
result as tuple deletion. Containment follows from Theorem 
3. □ 

Proof of Proposition 2: Take A = (Ai, ...A m ) and B = 
(Bi,..., B„). For any tuple of constants k, define R k = 



oa=%R. Let Bi denote the single attribute relation with 
attribute Bi whose tuples are the most frequently occurring 
values in ■KB i R k - That is, a £ -B* iff a £ TYBiR k and there 
is no b £ TYB i R k such that b occurs as the value of the Bi 
attribute in more tuples of R k than a does. Note that B k can 
be written as an expression involving R which is first order 
with a Count operator. The reduction produces (R',t) from 
(R, t), where 



R 1 



U W A R k x Bl x 



' Br, 



(13) 



The repairs of R' are obtained by keeping, for each set of 
tuples with the same key value, a single tuple with that key 
value and discarding all others. By Corollary A.l, in a MRI 
of D, the group Gj. of tuples such that A = k for some 
constant k has a common value for B also, and the set of 
possible values for B is the same as that of the tuple with 
key k in a repair of D. Since duplicates are eliminated from 
the MRIs, the set of MRIs of D is exactly the set of repairs 
of R'. D 



Proof of Theorem 6: Q' 

with the transformation R - 
with aggregation. 



is obtained by composing Q' 
R' , which is a first-order query 



